Deep learning for terahertz image denoising in nondestructive historical document analysis

Historical documents contain essential information about the past, including places, people, or events. Many of these valuable cultural artifacts cannot be further examined due to aging or external influences, as they are too fragile to be opened or turned over, so their rich contents remain hidden. Terahertz (THz) imaging is a nondestructive 3D imaging technique that can be used to reveal the hidden contents without damaging the documents. As noise or imaging artifacts are predominantly present in reconstructed images processed by standard THz reconstruction algorithms, this work intends to improve THz image quality with deep learning. To overcome the data scarcity problem in training a supervised deep learning model, an unsupervised deep learning network (CycleGAN) is first applied to generate paired noisy THz images from clean images (clean images are generated by a handwriting generator). With such synthetic noisy-to-clean paired images, a supervised deep learning model using Pix2pixGAN is trained, which is effective to enhance real noisy THz images. After Pix2pixGAN denoising, 99% characters written on one-side of the Xuan paper can be clearly recognized, while 61% characters written on one-side of the standard paper are sufficiently recognized. The average perceptual indices of Pix2pixGAN processed images are 16.83, which is very close to the average perceptual index 16.19 of clean handwriting images. Our work has important value for THz-imaging-based nondestructive historical document analysis.

www.nature.com/scientificreports/ typically has the image resolution of a few hundred microns 17 , which is much lower than X-ray and optical imaging, but is still sufficient for historical document analysis. Although THz imaging is challenging for scanning thick books, it is promising to extract information from documents consisting of a few paper layers like letters and papyrus scrolls 15 . Despite its clear advantages, THz imaging requires a trade-off between image quality and imaging speed 18 . THz images typically suffer from speckle noise 19 , especially in a fast imaging mode. Therefore, THz image denoising has an important value in practical applications. Various conventional algorithms have been applied to THz image enhancement such as adaptive filtering [20][21][22] and deconvolution methods [23][24][25] . Adaptive filtering filters out high-frequency noise while preserving the sharpness of edges. Deconvolution methods enhance THz image resolution and suppress noise based on the accurate modelling of the point spread function 23 . Compressed sensing techniques have also been widely investigated in THz image reconstruction 18,[26][27][28][29] . As compressed sensing is able to reconstruct images from relatively few measurements by the exploitation of sparsity, it has been demonstrated effective for high-speed THz imaging, like single-pixel THz imaging systems 28,29 . For example, Li et al. 18 proposed to combine the ant colony algorithm with a compressive sensing technique based on local Fourier transform, which reduces noise well while preserving edge information.
Recently, deep learning has achieved impressive results in various fields, including THz imaging 30 . Deep learning has been applied to segmentation and classification tasks in THz images such as impurity detection in wheat 31,32 , breast cancer classification 33 , and heavy-metal detection in soils 34 . The low resolution problem of THz imaging can also be mitigated by deep learning based super-resolution techniques 35,36 . In rapid THz imaging, deep learning can significantly reduce algorithm complexity and increase signal-to-noise ratio [37][38][39][40][41][42] . For example, Ljubenović et al. 37 used a convolutional neural network (CNN) for THz image deblurring and their work demonstrates the efficacy of CNNs for denoising on synthetic THz data. Choi et al. 42 adopted the WaveNet from the field of speech and audio for THz image denoising in the frequency domain for 1D temporal signals. To overcome limited training data, Jiao et al. 43 proposed a Noise2Noise-based network for THz spectrum denoising using transfer learning from low-quality underwater images. However, deep learning has not been investigated in THz imaging for historical document analysis yet.
The paper aims to improve THz image quality for historical document analysis by reducing imaging noise and artifacts, which commonly exist in reconstructed images processed by standard THz reconstruction algorithms. Our work demonstrates the feasibility of THz imaging in information retrieval from sealed envelopes. It also demonstrates the efficacy of deep learning for THz image enhancement for better character recognition. To the best of our knowledge, our work is the first to apply deep learning to THz image enhancement for historical document analysis. Our experiments indicate that the deep learning enhanced image quality relies on the paper type and the page sides, which is valuable information conveyed to the community. From our point of view, our work is a very important step towards real applications of THz imaging in nondestructive document analysis, which will encourage more research in this topic.

Materials
The THz images used in this work were acquired at the Institute of Microwaves and Photonics (LHFT), Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany. For the measurements, the commercial radar imaging system "Quality Automotive Radome Tester" by Rohde and Schwarz was used. It is a multiple-inputmultiple-output (MIMO) radar consisting of 3 × 4 sparse subarrays with 1128 transmit channels and 1128 receive channels in total. The applied signal is a 64-point stepped-frequency continuous-wave signal, in the range of 74 GHz to 79 GHz. More details of the scanner can be found here (https:// www. rohde-schwa rz. com/ us/ produ ct/ qar).
To mimic historic letters concealed in envelopes, two types of paper are used to create the images for the dataset. One dataset was made with A4 standard paper and the other with the A4 Xuan-Paper. The Xuan-Paper features great tensile strength, smooth surface, pure and clean texture, clean stroke, and excellent resistance to corrosion, moth, and mold. The Xuan-paper is thinner than the standard paper and hence the corresponding Xuan-paper THz images have less noise than the standard-paper THz images. In addition, the papers were written in two ways: one was written on a single side and the other was on both sides. Therefore the two-side written images contain overlapping letters. All the letters were written with the calligraphy ink Type 29770 from Rohrer & Klinger Company. For each letter, a 3-D volume was reconstructed with a size of 705 × 1025 × 97 voxels and an anisotropic voxel spacing of 0.5 × 0.5 × 0.573 mm 3 . To reduce the effect of paper wrinkles and tilt, maximum intensity projection along the Z-direction was used to convert 3-D volumes to 2-D images. Two THz image examples from the standard paper and the Xuan paper are displayed in Fig. 1a,b, respectively. The THz signal is emitted and caught by a Vector Network Analyzer(VNA) (Rohde & Schwarz ZVA 24) combined with frequency extenders (Rohde & Schwarz ZVA-Z325) for the range between 220 and 325 GHz 15 . Two spline horn antennas and two polyethylene dielectric lenses were also used to achieve optimal focusing.

Methods
As displayed in Fig. 1, the acquired THz images suffer from severe noise, which is an obstacle to the recognition of context in historic document analysis. This work aims to enhance THz images using deep learning methods. Since THz image acquisition is expensive and time-consuming, it is challenging to acquire paired clean and noisy THz images to train a supervised deep learning model. To overcome the lack of paired data, we propose to apply an unsupervised learning network, in particular CycleGAN, to generate paired images using unpaired synthetic clean images and real noisy images. The synthetic clean images are generated by a handwriting generator, and a learned CycleGAN model will add similar noise patterns into the synthetic clean images to construct clean and noisy image pairs. With such paired images, a supervised learning network, in particular Pix2pixGAN, is applied for final THz image denoising.  44 was employed to generate clean handwriting images. A black background was taken, and random letters in white were created over it using random fonts. 2000 clean images in total are created to train our models as the first step result. The outputs of the handwriting generator are binary images of letters with different font types. They are saved in 8-bit PNG format. Figure 2 displays two exemplary images generated by the handwriting generator with two different fonts.

Synthesis of paired data via CycleGAN using unpaired data. Conversion between clean and noisy
images is fundamentally an image-to-image translation task. Since only unpaired instead of paired synthetic clean images and real noisy THz images are available, CycleGAN 45 is applied for such unpaired image-to-image translation in this work. CycleGAN consists of two generators, G AB that transfers an image from domain A to B and G BA that transfers an image from domain B to A. In particular in our work, domain A contains clean text images generated by the handwriting generator and domain B contains images with THz imaging noise and artifacts. Two discriminators D A and D B distinguish whether an image belongs to that domain. For a pair of G AB and D B , the adversarial loss function is defined as, Similarly the adversarial loss for G BA and D A is defined as L GAN (G BA , D A ) . In addition, a cycle-consistency loss is applied to minimize the reconstruction error after an image of one domain to another is translated back to the original domain, The overall objective function is, In our work, during training we kept clean synthetic images created via our handwriting generator in one domain and kept the collected real THz images in the other domain, as displayed in Fig. 3a. During inference, the clean synthetic images are reused as the input test data, and CycleGAN outputs their corresponding paired noisy images, which share similar noise characteristics to the real noisy THz images. (1) (a) Standard paper, one-side (b) Xuan paper, one-side (c) Standard paper, two-side (d) Xuan paper, two-side Figure 1. Examples of THz images from one-side-standard paper (a), one-side-Xuan paper (b), two-sidestandard paper (c) and two-side-Xuan paper (d). www.nature.com/scientificreports/ Note that during inference, the real noisy THz images can be used as the input data as well and CycleGAN will output their corresponding denoised images. In this work, such direct denoising by CycleGAN is also investigated.
Image denoising using Pix2PixGAN. In this work, Pix2pixGAN 46 is applied to translate noisy THz images to denoised ones with paired data. Pix2pixGAN is a conditional GAN, which uses the U-Net as the generator, G and a 5-layer patch-wise convolutional classifier as the discriminator, D. G learns to convert noisy THz images into clean ones. D learns to distinguish the output denoised images from reference clean images. The objective of the conditional GAN is, where x is the input, y is the target, G tries to minimize this objective against an adversarial D that tries to maximize it, i.e., G * = arg min G max D L cGAN (G, D) . In addition, an ℓ 1 loss function is applied to train the generator's output close to the target with less blurring compared to ℓ 2 loss, The overall objective function is As displayed in Fig. 3b, during training the synthetic noisy images from CycleGAN are used as the input and the corresponding clean images from the handwriting generator are used as the target. Only synthetic images are used for training. During inference, the real noisy THz images are used as the input and Pix2pixGAN predicts their corresponding denoised versions.

Experimental setup
Training data synthesis using CycleGAN. The synthetic dataset was created using CycleGAN. For this experiment, the code from Jun-Yan Zhu et al. 45 available on GitHub was adopted. The basic model for the discriminator is a PatchGAN, with a patch of size 70 × 70 and a 9-layer ResNet as the generator. The dataset consisted of two domains, clean synthetic images created by the handwriting data generator and the original THz images. The model was trained using an Adam optimizer with a batch size of 2 for 200 epochs with an initial learning rate of 0.0002 to generate 2000 noisy synthetic images similar to the initial THz images. The weight for the cycle-consistent loss cyc is set to 0.5. For the generator, no dropout was applied. The input channel and output channel were both set to 1. The learning rate was kept the same for the first 100 epochs and linearly decayed to zero over the following 100 epochs. All the images were resized and cropped to 256 × 256 during data preprocessing, and no data augmentation was used. The rest of the parameters were kept unchanged with respect to 45 . Image denoising using Pix2pixGAN. The U-Net is used as the Pix2pixGAN generator, which contains 8 down-sampling modules as well as 8 skip connections. For more details, please refer to the "unet-256" configuration in the authors' implementation 46 . An Adam optimizer was used to train the model with a batch size of 5 for 200 epochs with a constant learning rate of 0.0002. The weight for the ℓ 1 loss was set to 100. It was trained with the 2000 paired noisy synthetic THz images created using CycleGAN, and the inference dataset consisted of the 34 original THz images. A validation dataset of 30 paired noisy synthetic THz images is used to monitor overfitting. The training and validation ℓ 1 losses of the generator are displayed in Fig. 4, where no obvious overfitting occurs. As proposed in the paper 46 , random jitter was applied by resizing the 256 × 256 input images to 286 × 286 and then randomly cropping them back to size 256 × 256 . The model weights were initialized following a Gaussian distribution with zero mean and standard deviation 0.02. The remaining parameters were kept the same as the standard version 46 .
Comparison algorithms. In this work, some exemplary results of other algorithms are also displayed as a comparison. The bilateral filter 47 and its trainable version 48 are applied to compare with well-known adaptive   www.nature.com/scientificreports/ filters. In particular trainable bilateral filter versions have been shown to provide robust denoising performance in the context of medical imaging 49 . The iterative reweighted total variation (wTV) algorithm 50 is selected as a compressed sensing representative. The half instance normalization network (HINet) 51 is chosen as a general deep learning denoising representative. Self-supervised learning algorithms do not rely on labelled training data, which can avoid the data scarcity problem. In this work, three self-supervised learning algorithms are selected: Noise2Self 52 , Noise2Void and Self-supervised vision transformer (SiT) 53 . SiT applies the latest techniques of transformers. Noise2Self and Noise2Void are well-known self-supervised denoising algorithms. In our experiments, three trainable bilateral filter layers are trained in a self-supervised way using the Noise2Void method following the setup of Wagner et al. 48 .
Evaluation metrics. Since ground truth images are not available for the CycleGAN synthetic images and the denoised real THz images, a non-reference image quality metric called perceptual index (PI) 54 is used to quantify these images. The perceptual index is calculated from the non-reference metrics of the natural image quality evaluator (NIQE) 55 and the Ma's score 56 , both of which extract image features to compute the perceptual quality. For super resolution tasks on natural images, a lower PI value corresponds to richer fine structures and hence indicates better perceptual quality. In our application, a lower PI value corresponds to more high-frequency noise/artifacts in general. The average PI value of all the original noisy THz images is 6.85 with a standard deviation of 0.60, while that of the clean handwriting generator images is 16.19 with a standard deviation of 0.45. Therefore, larger PI scores are desired for our denoising results. In addition, a custom approach is applied to quantify the algorithms used to denoise the THz images. As this paper aims to reduce the noise of THz images and finally retrieve the original data ideally or at least its structure, the characters visible with bare eyes are counted as a success, and if a character, any part of it or the entire character was missing, it is not considered as a valid output. The same accuracy calculation has been followed in the case of overlapping characters. Two overlapped characters count as a single structure for both-sided written images, so it is impossible to identify the characters separately in this case. The correct retrieval of overlapped characters' structure is counted as a success. The results are differentiated by the type of paper.
The accuracy is measured according to Eq. (8), and a comparative result is displayed in Table 2 for the Xuan-Paper and standard paper.

Results
CycleGAN results. One exemplary synthetic image from CycleGAN is displayed in Fig. 5c together with its corresponding clean input image Fig. 5b and a real THz image Fig. 5a. Figure 5a,c have similar appearance, although the two characters indicated by the arrows are hardly visible. The histograms of Fig. 5a,c are displayed in Fig. 5d, which indicates that the synthetic image also has similar intensity distributions to the real THz image. The average mean intensity, average standard deviation, and average total variation (TV) values for all the real and synthetic images are displayed in Table 1. For all the synthetic images, the average perceptual index is 4.52 with a standard deviation of 0.83. To show the overall appearances of the synthetic images, four additional synthetic images together with their PI values are displayed in Fig. 6e-h. Figure 5e is a typical example of the Cycle-GAN synthetic images like Fig. 5c. Figure 6f-h have slightly different appearances: Fig. 5f contains high-intensity artifacts surrounding each character; Fig. 5g contains wrinkle-like structures in the background; Fig. 5h is very bright for both characters and artifacts.
Two exemplary CycleGAN prediction results using real noisy THz images as the input are displayed in Fig. 6. In Fig. 6b,e, although noise is reduced, many fragments of the characters are removed or random strokes are added. Hence, only a small portion of characters are recognized. For example, in Fig. 6b only the characters "C", "D", "N", "P" and "S" are correctly restored, and in Fig. 6e only the characters "D", "G", "R", "N" and "S" are correctly restored. Figure 6 indicates that directly using CycleGAN for THz image denoising is insufficient.   (Fig. 6a), its Pix2pixGAN output is entirely noiseless and all the characters in this image can be well recognized, as shown in Fig. 6c. The result of the standard-paper input is noise-free as well in Fig. 6f. Due to the relatively high-level noise in THz images using standard paper, some parts of certain characters are missing in Fig. 6f, for example, the letter "E" and "Z". Nevertheless, other characters like "C" and "S" are well recognized.
Two exemplary results of Pix2pixGAN on two-sided written THz images are displayed in Fig. 7. For both Xuan and standard paper, noise (artifacts) is removed, although some residual artifacts remain in the background. Compared with characters written on the back side, those on the front side are recognized much better. Nevertheless, the interpreted letter "G" in Fig. 7b is actually either "Q" or "O" in Fig. 7a, while the letter "C" in Fig. 7d is actually a mixture of two letters in the input image Fig. 7c.
The results of comparison algorithms on the same THz image written on Xuan paper (Fig. 6a) are displayed in Fig. 8. Figure 8a demonstrates that a bilateral filter with hand-picked filter parameters can reduce the noise and image artifacts to some degree, but the resultant background appears blurry. In Fig. 7b, the noise and artifacts are reduced as well. However, some "shadow" artifacts remain. The HINet result in Fig. 7c has the best binarization performance, although some artifacts remain. Like bilateral filter and wTV, HINet is able to improve the image quality, but many fragments of the characters are missing. The self-supervised learning algorithms all fail to reduce noise or artifacts, as displayed in Fig. 8d-f. Therefore, they are excluded for further quantification in Table 2. The character recognition accuracy in Table 2 indicate that almost all the characters (99%) in the Pix2pixGAN results can be recognized for Xuan paper, while ≤ 50% characters are recognized in the results of other algorithms. For standard paper, only 61% characters are recognized in the Pix2pixGAN results. But it is still higher than the accuracies of other algorithms. The PI scores of the bilateral filtering and wTV results are smaller than the average PI (16.19) of the clean handwriting generator images, which indicates noise and artifacts remain in such images. In contrast, the PI scores of HINet are larger than 16.19, which indicates good binarization of their results. However, the missing fragments in its processed images result in sparser image features, which

Discussion
CycleGAN should be able to convert clean images into noisy ones and reversely convert noisy images into clean ones in the ideal case. In our work, Fig. 5 demonstrates that CycleGAN is able to generate realistic noisy images from clean images generated by a handwriting generator. However, it is not able to generate satisfying denoised images directly from real noisy THz images as shown in Fig. 6. CycleGAN does a better job in translating clean images to noisy ones than translating noisy images to clean ones as we observed. This could be explained using the concept of entropy: getting noisy images, which have higher entropy, is easier than getting clean images, which have lower entropy. Therefore, CycleGAN is applied to generate the paired noisy image of the clean handwriting images first, and then an additional supervised-learning network trained from such paired data is applied to get the final denoised images. Data scarcity is a common problem for deep learning applications. Generating synthetic data is commonly used nowadays for training deep learning models in various fields 57,58 , which have been demonstrated good generalizability to real data. The results in this work demonstrate that using synthetic data for training supervised  www.nature.com/scientificreports/ deep learning models is also effective for THz image denoising. This encourages further deep learning based THz applications. Figures 6 and 7 reveal which types of historical documents are suitable for context retrieval by THz imaging: (a) Fig. 6c demonstrates the efficacy of Pix2pixGAN in THz image denoising for one-sided Xuan paper; (b) Fig. 6f indicates that THz imaging with deep learning denoising has the potential to reveal most information written in a single-page standard paper; (c) Fig. 7 indicates that character recognition in THz images for documents with double-sided text is very challenging, regardless whether Xuan or standard paper is used.
In the real THz images, not only high-frequency noise but also image artifacts with high-intensity block-like structures exists. Conventional denoising algorithms like (trainable) bilateral filter and wTV are effective in reducing high-frequency noise. However, they are not optimal to remove structured artifacts. The HINet is also a supervised learning network using the same training data as Pix2pixGAN. It learns to binarize the real THz images from synthetic training data. However, due to its limited representation power by architecture design (design for denoising only), it is not able to restore missing fragments of the characters. The self-supervised learning networks like Noise2Self or Noise2Void consider local noise characteristics, like the J-invariant 52 . Therefore, such networks are optimized to denoise random noise based on local neighbourhoods, but not suitable for block-like structured artifacts. To develop effective self-supervised learning algorithms for such THz images, further research is required.
Some characters written on one-side-standard paper are ambiguous to recognize after Pix2pixGAN denoising, for example, the letters "E", "F" and "G" in Fig. 6f. In our experiments, only individual characters, instead of words or sentences, are written on the pages, which increases ambiguity once any character misses fragments. Such ambiguity can potentially be reduced for words and sentences based on their surrounding context. In other words, spell correction can be performed to get meaningful words and sentences and hence reduce ambiguity.  www.nature.com/scientificreports/ This is one potential advantage of real historical document analysis. To generate synthetic data for training, more sophisticated handwriting styles are available 44,59 . However, real historical documents contain many other challenges, for example, blurred handwriting due to aging and imaging shadow artifacts caused by paper wrinkles. Such challenges require our future exploration. Nevertheless, this work is an important step towards real nondestructive historical document analysis using THz imaging. In this work, the CycleGAN and Pix2pixGAN models are purely data driven. Data driven deep learning models may not generalize well to out-of-distribution test data and are sensitive to noise or perturbations 49,60 . Therefore, in our CycleGAN results, some synthetic images have different appearance characteristics (e.g., Fig. 5h), which we exclude for training Pix2pixGAN. Developing physics-informed neural networks 61 , which are built based on known operators 62 and hence can combine the advantages of both deep learning and conventional methods, for supervised learning should be investigated in our future work. Conventional THz imaging theories have the potential to develop more robust and effective neural networks for THz image enhancement. For example, the conventional mathematical modelling of THz point spread function and simulation of THz imaging systems 23 can guide CycleGAN or a customly designed network to generate more diverse and realistic THz images 63 for training Pix2pixGAN, which may enable Pix2pixGAN to generalize well for THz images acquired from various system settings.

Conclusion
This work applies deep learning to denoise THz images for nondestructive historical document analysis. To overcome the data scarcity problem when training a supervised deep learning model, an unsupervised learning network, CycleGAN, is applied first to generate paired noisy images from clean synthetic images generated by a handwriting generator. Such synthetic paired data is effective to train Pix2pixGAN for THz image denoising. Our work demonstrates that the deep learning denoising performance as well as the resultant character recognition accuracy depends highly on the paper type: Context can be easily retrieved on one-side-Xuan paper after Pix-2pixGAN denoising; Most context written on one-side-standard paper can still be retrieved using Pix2pixGAN; However, context written on both sides is very challenging to retrieve due to the overlap of characters. This work is an important step towards real THz-imaging-based nondestructive historical document analysis.

Data availability
The datasets generated and/or analyzed during the current study are not publicly available but are available from the corresponding author on reasonable request.